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Abstract: 

We propose an algorithm to estimate the common density s of a stationary 
process X±, ...,X n . We suppose that the process is either (3 or r-mixing. We 
provide a model selection procedure based on a generalization of Mallows' C p 
and we prove oracle inequalities for the selected estimator under a few prior 
assumptions on the collection of models and on the mixing coefficients. We 
prove that our estimator is adaptive over a class of Besov spaces, namely, we 
prove that it achieves the same rates of convergence as in the i.i.d framework. 
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1 Introduction 

We consider the problem of estimating the unknown density s of P, the law of a random 
variable A, based on the observation of n (possibly) dependent data X\, ...,X n with com- 
mon law P. We assume that X is real valued, that s belongs to L 2 ([i) where fi denotes the 
Lebesgue measure on E and that s is compactly supported, say in [0, 1]. Throughout the 
chapter, we consider least-squares estimators s m of s on a collection (S m ) m ^M n of linear 
subspaces of L 2 (fi). Our final estimator is chosen through a model selection algorithm. 
Model selection has received much interest in the last decades. When its final goal is pre- 
diction, it can be seen more generally as the question of choosing between the outcomes of 
several prediction algorithms. With such a general formulation, a very natural answer is 
the following. First, estimate the prediction error for each model, that is \\s — SmW?,. Then, 
select the model which minimizes this estimate. 

It is natural to think of the empirical risk as an estimator of the prediction error. This can 
fail dramatically, because it uses the same data for building predictors and for comparing 
them, making these estimates strongly biased for models involving a number of parameters 
growing with the sample size. 

In order to correct this drawback, penalization's methods state that a good choice can be 
made by minimizing the sum of the empirical risk (how do algorithms fit the data) and 
some complexity measure of the algorithms (called the penalty). This method was first 
developped in the work of Akaike [2] and [l| and Mallows (l9| |. 

In the context of density estimation, with independent data, Birge & Massart j§| used 
penalties of order L n D m /n, where D m denotes the dimension of S m and L n is a constant 
depending on the complexity of the collection M. n . They used Talagrand's inequality (see 



for example Talagrand [24|] for an overview) to prove that this penalization procedure is 
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efficient i.e. the integrated quadratic risk of the selected estimator is asymptotically equiv- 
alent to the risk of the oracle (see Section 2 for a precise definition). They also proved that 
the selected estimator achieves adaptive rates of convergence over a large class of Besov 
spaces. Moreover, they showed that some methods of adaptive density estimation like the 



unbiased cross validation (Rudemo [23]) or the hard thresholded estimator of Donoho et 
al. [3] can be viewed as special instances of penalized projection estimators. 
More recently, Arlot [H] introduced new measures of the quality of penalized least-squares 
estimators (PLSE). He proved pathwise oracle inequalities, that is deviation bounds for 
the PLSE that are harder to prove but more informative from a practical point of view 
(see also Section 2 for details). 

When the process (Aj)j=i,...,n is /3-mixing (Rozanov & Volkonskii [2fi| and Section 2), Ta- 
lagrand's inequality can not be used directly. Baraud et al. 0| used Berbee's coupling 
lemma (see Berbee ([3]) and Viennet's covariance inequality (Viennet jiH]) to overcome 
this problem and build model selection procedure in the regression problem. Then Comte 
& Merlevede [iji] used this algorithm to investigate the problem of density estimation for 
a /3-mixing process. They proved that under reasonable assumptions on the collection M. n 
and on the coefficients j3, one can recover the results of Birge & Massart 0] in the i.i.d. 
framework. 

The main drawback of those results is that many processes, even simple Markov chains 
are not /3-mixing. For instance, if (ej)j>i is iid with marginal 13(1/2), then the stationary 
solution (Xj)j>o of the equation 

X n = ^(X n _i + e n ), X independent of (e^oi (1) 

is not /3-mixing (Andrews 0]). More recently, Dedecker & Prieur 15] introduced new 
mixing-coefficients, in particular the coefficients r, <j> and /3 and proved that many processes 
like (pQ) happen to be r, (j> and /3-mixing. They proved a coupling lemma for the coefficient 
r and covariance inequalities for cf> and 0. Gannaz & Wintenberger [3| used the covariance 
inequality to extend the result of Donoho et al. [3] for the wavelet thresholded estimator 
to the case of (^-mixing processes. They recovered (up to a log(ra) factor) the adaptive 
rates of convergence over Besov spaces. 

In this article, we first investigate the case of /3-mixing processes. We prove a pathwise 



oracle inequality for the PLSE. We extend the result of Comte & Merlevede [13j| under 
weaker assumptions on the mixing coefficients. Then, we consider r-mixing processes. The 
problem is that the coupling result is weaker for the coefficient r than for f3. Moreover, 
in order to control the empirical process we use a covariance inequality that is harder to 
handle. Hence, the generalization of the procedure of Baraud et al. [6] to the framework 
of r-mixing processes is not straightforward. We recover the optimal adaptive rates of 
convergence over Besov spaces (that is the same as in the independent framework) for 
r-mixing processes, which is new as far as we know. 

The chapter is organized as follows. In Section 2, we give the basic material that we will 
use throughout the chapter. We recall the definition of some mixing coefficients and we 
state their properties. We define the penalized least-squares estimator (PLSE). Sections 3 
and 4 are devoted to the statement of the main results, respectively in the /3-mixing case 
and in the r-mixing case. In Section 5, we derive the adaptive properties of the PLSE. 
Finally, Section 6 is devoted to the proofs. Some additional material has been reported in 
the Appendix in Section 7. 
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2 Preliminaries 



2.1 Notation. 

Let (Q,A,P) be a probability space. Let /i be the Lebesgue measure on R, let ||.|| be the 

usual norm on L p (fi) for 1 < p < oo. For all y £ R', let |y| ; = Yl\=i \Vi\- Denote by X K the 
set of K-Lipschitz functions, i.e. the functions t from (R , |.|j) to R such that Lip(t) < k 
where 

Lip(t) = sup { ^ - ^ , x, y e R l , x + y] < k. 
{ \x-y\i J 

Let BV and BV\ be the set of functions t supported on R satisfying respectively ||t|| B y < oo 
and ||t|| B y < 1 where 

Pllw =su P SU P \t(a i+ i) - t(ai)\. 

nSN* — oo<ai<...<a„<oo 

2.2 Some measures of dependence. 
2.2.1 Definitions and assumptions 

Let Y = (Yi, ...,Yi) be a random variable defined on (Cl,A,¥) with values in (R*, |.|j). Let 
M. be a cr-algebra of A. Let Py|x, ^y^m be conditional distributions of Y and Y\ given 
M., let Py, Py be the distribution of Y and Y\ and let F Yl \Mi ^Vi be distribution functions 
of Py 1 |7n and Py. Let £> be the Borel a-algebra on (M. 1 , \ Define now 



P(M,a(Y)) = E SU P |Py|^(A)-Py(A)| 

\Aet3 

P(M,Y}) = E(snp\F YllM (x)-FY 1 (x)\), 

and if E(|Y|) < oo, t(M,Y) = E ( sup |Pyi M (i) - Py(t)| ) . 

\te\i J 

The coefficient /3(M,a(Y)) is the mixing coefficient introduced by Rozanov & Volkonskii 
26]. The coefficients 0(M,Yi) and t(M,Y) have been introduced by Dedecker & Prieur 

a. 

Let (Xkjk^z be a stationary sequence of real valued random variables defined on (£l,A,F). 
For all fc£N*, the coefficients (3^ and are defined by 

[3 k = f3(a(Xi,i< 0),a(Xi,i> k)), h = sup{P(a(X p ,p < 0),X,)}. 

If E(|Xi|) < oo, for all k G N* and all r £ N*, let 

Tfc,r = max - sup {r(a(X p ,p < 0), (X h , X h ))}, T k = sup r fc . r . 
l<Kr i k<h<..<ii reN* 

Moreover, we set /?o = 1. In the sequel, the processes of interest are either /3-mixing or 
r-mixing, meaning that, for 7 = (3 or r, the 7-mixing coefficients 7^ — ► as k — > +00. For 
p € {1, 2}, we define k p as: 

00 

^ P =pJ2 lP ~ 1 ^ ( 2 ) 
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where 0° = 1, when the series are convergent. Besides, we consider two kinds of rates of 
convergence to of the mixing coefficients, that is for 7 = (3 or r, 

[AR] arithmetical 7-mixing with rate 6 if there exists some 6 > such that 7^ < (1 + 
for all k in N, 

[GEO] geometrical 7-mixing with rate 6 if there exists some 6 > such that 7^ < e~ ek 
for all k in N. 



2.2.2 Properties 
Coupling 

Let X be an Revalued random variable defined on (f2,„4,P) and let M. be a cr-algebra. 
Assume that there exists a random variable U uniformly distributed on [0, 1] and indepen- 
dent of M. V cr(X). There exist two A4 V cr(X) V cr(C/)-measurable random variables X^ 
and X 2 distributed as X and independent of M. such that 

(3(M,a(X)) = F(X / X{) and (3) 

t(M,X)=E(\X-X* 2 \ 1 ). (4) 

Equality j3]) has been established by Berbee 0|, Equality (j4j) has been established in 
Dedecker & Prieur [la], Section 7.1. 
Covariance inequalities 

Let X, Y be two real valued random variables and let /, h be two measurable functions 
from M to C. Then, there exist two measurable functions b\ : R — > M and b 2 : ^ — ► K with 
E(6i(X)) = E(62(y)) = /3(cr(X), a(Y)) such that, for any conjugate p,q>l (see Viennet 
[251 ] Lemma 4.1) 

\Cov(f(X),h(Y))\ < 2E 1 /P(|/(X)|^ 1 (X))E 1 ^(|/i(y)|^ 2 (y)). 

There exists a random variable b(a(X), Y) such that K(b(a(X), Y)) = (3(a(X), Y) and such 
that, for all Lipschitz functions / and all h in BV (Dedecker & Prieur [l5j Proposition 1) 

|Cov(/(X), h(Y))\ < \\h\\ BV E (\f(X)\b(a(X),Y)) < \\h\\ BV H/^ ~0{a{X), Y). (5) 

Comparison results 

Let (^Oc)fcez ^ e a sequence of identically distributed real random variables. If the marginal 
distribution satisfies a concentration's condition \Fx{x) — Fx(y)\ < K\x — y\ a with a < 1, 
K > 0, then (Dedecker e£ a/. [IJ] Remark 5.1 p 104) 

fa < 2^ 1 /(i+«) r «/(«+ 1 ) < 2K 1 l { - l+a \l /{a+l) . 

In particular, if Fx has a density s with respect to the Lebesgue measure fi and if s S £ 2 (/i), 
we have from Cauchy-Schwarz inequality 



\F x (x) - F x (y)\ = I J l [X) y]sdn\ < \\s\\ 2 (J l[ Xjl/ ]d/x 



1/2 



l s ll 2 k ~ y| 1/2 ' 



thus 



4<2|| S ||f r fe 1/3 . 



In particular, for any arithmetically [AR] T-mixing process with rate 9 > 2, we have 

p k <2\\ 8 ff 3 (i + ky^/ 3 . (6) 
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2.2.3 Examples 



Examples of /3-mixing and r-mixing sequences are well known, we refer to the books of 
Doukhan [lTj and Bradley [HI] for examples of /3-mixing processes and to the book of 



Dedecker et. 
al 



al 



et. 



141 ] or the articles of Dedecker & Prieur [la], Prieur [2l|], and Comte 
121 ] for examples of r-mixing sequences. One of the most important example is 



the following: a stationary, irreducible, aperiodic and positively recurent Markov chain 
(Xi)i>i is /3-mixing. However, many simple Markov chains are not /3-mixing but are r- 
mixing. For instance, it is known for a long time that if (ej)i>i are i.i.d Bernoulli £>(l/2), 
then a stationary solution (A"j)j>o of the equation 



i + e n ), Xq independent of ((-i)i>i 
1 for any k > 1 whereas tj, < 2~ fc (see Dedecker & Prieur 



15] 



is not /3-mixing since (3^ 
Section 4.1). Another advantage of the coefficient r is that it is easy to compute in many 
situations (see Dedecker & Prieur [la] Section 4). 



2.3 Collections of models 

We observe n identically distributed real valued random variables X±, ...,X n with common 
density s with respect to the Lebesgue measure fi. We assume that s belongs to the Hilbert 
space L 2 {n) endowed with norm ||.|| 2 . We consider an orthonormal system {ipj,k}/j k)eA 
of L2(n) and a collection of models {Sm)m&M n indexed by subsets m C A for which we 
assume that the following assumptions are fulfilled: 

[Mi] for all m G A4 n , S m is the linear span of {tpj ; k}(j k)em w ^h finite dimension D m = 
\m\ > 2 and N n = max me _yvi n D m satisfies N n < n; 
[M2] there exists a constant "3> such that 

Vm,m' G M n yt G S m ,W G S m ; \\t + t'||oo < ^^dimiSm + S ml )\\t + t'\\ 2 ; 

[M3] D m < D m / implies that m C m' and so S m C S^,/. 
As a consequence of Cauchy-Schwarz inequality, we have 

E fa 

(j,k)£mUm' 

see Birge & Massart 0] p 58. Three examples are usually developed as fulfilling this set of 
assumptions: 

[T] trigonometric spaces: ipo,o(x) = 1 and for all j G N*, i/)j^i{x) = cos(27rjx), 4>j,2{x) = 

sin(27rjx). m = {(0,0), (j, l),'(j',2), 1 < j,f < J m } and D m = 2J m + 1; 

[P] regular piecewise polynomial spaces: 5 m is generated by r polynomials tpj^ of degree 

k = 0, ...,r - 1 on each subinterval [(j - 1)/J m ,j/J m ] for j = 1, J m , _D m = rJ m , 

A^ n = {m = {(j,k), j = 1, J m , fc = 0, ...,r- 1}, 1 < J m < [n/r]}; 

[W] spaces generated by dyadic wavelet with regularity r as described in Section 4. 

For a precise description of those spaces and their properties, we refer to Birge & Massart 

ft 



teS, 



sup 



2 



(7) 



2.4 The estimator 

Let (A" n ) nG z be a real valued stationary process and let P denote the law of Xq. Assume 
that P has a density s with respect to the Lebesgue measure n and that s G £2 (/•*)• 
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Let (S m ) m £M n be a collection of models satisfying assumptions [Afi]-[M3]. We define 
S n — ^meM„S m , s m and s n the orthogonal projections of s onto S m and S n respectively, 
let P be the joint distribution of the observations {X n ) n& % and let E be the corresponding 
expectation. We define the operators P ni P and v n on L 2 (/j) by 

i n r 

Put = -Vt(Ii), Pt = / t(x)s(x)dfl(x), U n {t) = (Pn ~ P)t. 

All the real numbers that we shall introduce and which are not indexed by m or n are fixed 
constants. In order to define the penalized least-squares estimator, let us consider onKxS„ 
the contrast function 7(x,t) = — 2t(x) + p]^ and its empirical version J n (t) = P n / y(.,t). 
Minimizing 7n(i) over S m leads to the classical projection estimator s m on S m . Let s n be 
the projection estimator on S n . Since {4>j,k}(j k)&m ' s an orthonormal basis of S m one gets 



y~] (Pntpj,k)tpj,k and 7„(s m ) = - (-Pn^j, 



Now, given a penalty function pen : A4 n — ► M + , we define a selected model rh as any 
element 

m G arg min (7„(s m ) + pen(m)) (8) 
and a PLSE is defined as any s G S 1 .,^ C 5 n such that 

7n(«) +pen(m) = inf {j n (s m ) + pen(m)) . (9) 

meMn 

2.5 Oracle inequalities 

An ideal procedure for estimation chooses an oracle 

m G Arg min {\\s - s m \\ 2 }. 

m£Mn 

An oracle depends on the unknown s and on the data so that it is unknown in practice. 
In order to validate our procedure, we try to prove: 
-non asymptotic oracle inequalities for the PLSE: 

IE (\\s - s\\l) < L inf {E(||a-a m ||| + R(m,n))}, (10) 

V / mGMn 

for some constant L > 1 (as close to 1 as possible) and a remainder term R(m, n) > 
possibly random, and small compared to E y\\s — sW^j if possible. This inequality compares 
the risk of the PLSE with the best deterministic choice of m. Since rh is random, we prefer 
to prove a stronger form of oracle inequality : 

E (\\s - s\\l) < LE ( inf {\\s - s m \\l + R(m, n)} } , (11) 

V / \m£Mn J 

or, when it is possible, deviation bounds for the PLSE: 

F[\\s-s\\l>L inf (\\s-s m \\l + R(m,nj)) < c n , (12) 
y meMn \ J J 
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where typically c n < C/n 1+,y for some 7 > 0. Inequality (fT2l) proves that, asymptotically, 

1 1 2 

the risk \\s — s\\ 2 is almost surely the one of the oracle. Let 



We have 



^ = { \\ s ~ s 1 1 o > L inf (\\s — s m \\ 2 + R(m,n] 

m£Mn 



= E ( ||s - s\\\ In) + E ( \\s — s\\l Iqc 



It is clear that E (\\s — s\\\ IqcJ < LE ^inf me _A/i n — s m \\\ + i?(m,n)|^ . Moreover, we 
have ||s — s|| 2 = \\s — s^W 2 + \\sfh — s\\ 2 < IN| 2 + & 2 Dm < INI 2 + $ 2 w, thus, when ([l - 



holds, we have 



||s-s|| 2 1qc) < (|| s || 2 + $ 2 n )c n < ^. 



Therefore, inequality (fT2l) implies 

E f||s - sHo) < E ( inf {||s- s m ||2 + i?(m,n)} ] + 

V / \m£Mn J 



c_ 



We can derive from these inequalities adaptive rates of convergence of the PLSE on Besov 
spaces (see Birge & Massart [8J for example). In order to achieve this goal, we only have 
to prove a weaker form of oracle inequality where the remainder term R(m,n) < LD m /n 
for some constant L, for all the models m with sufficiently large dimension. This will be 
detailed in Section 5. 

3 Results for /3-mixing processes 

Prom now on, the letters k, L and K, with various sub- or supscripts, will denote some 
constants which may vary from line to line. One shall use L. to indicate more precisely the 
dependence on various quantities, especially those which are related to the unknown s. 
In this section, we give the following theorem for 0- mixing sequences. It can be seen as a 
pathwise version of Theorem 3.1 in Comte & Merlevede [13l |. 

Theorem 3.1 Consider a collection of models satisfying [Mi], [M2] and [M3]. Assume 
that the process (X n ) n( zz is strictly stationary and arithmetically [AR] 0-mixing with mix- 
ing rate 9 > 2 and that its marginal distribution admits a density s with respect to the 
Lebesgue measure fi, with s S £2 (/•*)■ 

Let K\ be the constant defined in ^ and let s be the PLSE defined by f5jj with 

penim) = — — , where K > 4. 

n 

Then, for all n > 2 there exist cq > 0, L s > 0, 71 > and a sequence e n — > 0, such that 

2 ( 2 \\ (logn) (9+2)K 
! - s|| 2 > (1 + e n ) inf (\\s - s m \L + pen(m) ) ) < L s ^ . 

(13) 



Remark: The term K<& 2 k\ is the same as in Theorem 3.1 of Comte &: Merlevede [131 ] but 
with a constant K > 4 instead of 320. The main drawback of this result is that the penalty 
term involves the constant K\ which is unknown in practice. However, Theorem 13.11 ensures 
that penalties proportional to the linear dimension of S m lead to efficient model selection 
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procedures. Thus we can use this information to apply the slope heuristic algorithm intro- 
duced by Birge & Massart 0] in a Gaussian regression context and generalized by Arlot 
& Massart [J] to more general M-estimation frameworks. This algorithm calibrates the 
constant in front of the penalty term when the shape of an ideal penalty is available. The 
result of Arlot & Massart is proven for independent sequences, in a regression framework, 
but it can be generalized to the density estimation framework, for independent as well as 
for (3 or r dependent data. This result is beyond the scope of this chapter and will be 
proved in chapter 4. 

We have to consider the infimum in equation ([TBI over the models with sufficiently large 
dimensions. However, as noted by Arlot [H] (Remark 9 p 43), we can take the infimum over 
all the models in (fT3l) if we add an extra term in (fl"3j) . More precisely, we can prove that, 
with probability larger than 1 — L s (log n)( 0+2 ' K /n e / 2 



where L > and 72 > 0. 

Remark : The main improvement of Theorem 13.11 is that it gives an oracle inequality in 
probability, with a deviation bound of order o{l/n) as soon as 9 > 2 instead of 6 > 3 in 
Comte & Merlevede Moreover, we do not require s to be bounded to prove our result. 
Remark: When the data are independent, the proof of Theorem 13.11 can be used to 
obtain that the estimator s chosen with a penalty term of order K<&D m /n satisfy an oracle 
inequality as (fT3|) . The main difference would be that k± = 1, thus it can be used without 
a slope heuristic (even if this algorithm can be used also in this context to optimize the 
constant K) and the control of the probability would be L s e~ ln ^ l Ga for some constants 
L S ,C S instead of L s (log n)^ e+2 ^ Kn~ 9 ^ 2 in our theorem. 

4 Results for r-mixing sequences 

In order to deal with r-mixing sequences, we need to specify the basis (ipj,k)(j,k)eA- 
4.1 Wavelet basis 

Throughout this section, r is a real number, r > 1 and we work with an r-regular or- 
thonormal multiresolution analysis of i^Gu), associated with a compactly supported scal- 
ing function <\> and a compactly supported mother wavelet tp. Without loss of generality, 
we suppose that the support of the functions <p and ip is an interval [Ai,^) where A\ 
and A2 are integers such that A2 — A\ = A > 1. Let us recall that <j> and ij) generate an 
orthonormal basis by dilatations and translations. 



For all k G Z and j eW, let ^o,fc : x -> V2<p(2x - k) and tjj jik : x -> 2^l 2 ^(Vx - k). The 
family {(V ; j,fc)j>o,fcez} is an orthonormal basis of -Z^O-t)- Let us recall the following inequal- 
ities: for all p > 1, let K p = (v^lMlp) V ||^|| p , K L = (2 v / 2Lip(<^)) V Lip(^), K BV = AK L . 
Then for all j > 0, we have ||V>j,fc||oo < K^l 2 , 




(14) 




00 



(15) 



Lip(^i,jfc) 

\\^j,k\\BV 



< 



< 



K L 2^' 2 , 
K BV 2 j l 2 . 



(16) 
(17) 



S 



We assume that our collection (S m ) me Mn satisfies the following assumption: 

[W] dyadic wavelet generated spaces: let J n = [log(ra/2(A + l))/log(2)] and for all J m = 

1, J n: let 

m = {(0, k), —A 2 < k <2 - A{\ U {{j, k), l<j< J m , -A 2 < k < -A x + 2^'} 

and S m the linear span of {if)j,k}(j,k)em- I* 1 particular, we have D m = (A— 1)( J m +l)+2 Jm+l 
and thus 2 Jm+1 < D rn < (A - l)(J m + 1) + 2 Jm+1 < A2 J ™ +1 . 

4.2 The r-mixing case 

The following result proves that we keep the same rate of convergence for the PLSE based 
on r-mixing processes. 

Theorem 4.1 Consider the collection of models [W]. Assume that (X n ) n& i is strictly 
stationary and arithmetically [AR] r-mixing with mixing rate 6 > 5 and that its marginal 
distribution admits a density s with respect to the Lebesgue measure //. Let s be the PLSE 
defined by (G|) with 

penim) = KAK^Kbv where K - 8 ' 

Then there exist constants cq > 0, 71 > and a sequence e n — > such that 

IE (l|s — < (l + e n ) f inf \\s - s m \\ 2 2 + pen(m)) . (18) 



Remark : As in Theorem l3.ll the penalty term involves an unknown constant and we have 
a condition on the dimension of the models in (fT8l) , However, the slope heuristic can also 
be used in this context to calibrate the constant and a careful look at the proof shows that 
we can take the infimum over all models m £ M. n provided that we increase the constant 
K in front of the penalty term. Our result allows to derive rates of convergence in Besov 
spaces for the PLSE that correspond to the rates in the i.i.d. framework (see Proposition 

E2I). 

Remark : Theorem 14.11 gives an oracle inequality for the PLSE built on r-mixing se- 
quences. This inequality is not pathwise and the constants involved in the penalty term 
are not optimal. This is due to technical reasons, mainly because we use the coupling result 
dl} instead of ((3|). However, we recover the same kind of oracle inequality as in the i.i.d. 
framework (Birge and Massart 0]) under weak assumptions on the mixing coefficients since 
we only require arithmetical [AR] r-mixing assumptions on the process (X n ) n ^x- This is 
the first result for these processes up to our knowledge. 



Let us mention here Theorem 4.1 in Comte & Merlevede [13||. They consider a-mixing pro- 
cesses (for a definition of the coefficient a and its properties, we refer to Rio [H]). They 
make geometrical [GEO] a-mixing assumptions on the processes and consider penalties of 
order Llog(n)D m /n to get an oracle inequality. This leads to a logarithmic loss in the rates 
of convergence. They get the optimal rate under an extra assumption (namely Assumption 
[Lip] in Section 3.2). There exist random processes that are r-mixing and not a-mixing 
(see Dedecker & Prieur [H]), however, the comparison of these coefficients is difficult in 
general and our method can not be applied in this context. 
The constants co,7i,n Q are given in the end of the proof. 
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Remark : Inequality (2.6) can be improved under stronger assumptions on s. For exam- 
ple, when s is bounded, we have 0k < C^/rfc. Under this assumption and 9 > 3, we can 
prove that the estimator s satisfies the inequality 

IE — < (l + e n ) mf \\s - s m \\ 2 2 + pen(m) + v ° / . 

V ' \mdM n , D m > Co (logn)Ti / n( e ~ 3 )/ 2 

When # < 5, the extra term (log n) K ^ +1 ^/n^~ 3 ^ 2 may be larger than the main term 
inf mgv vf n . _D m >c (logn.pi II s ~~ s m||| + pen(m). In this case, we don't know if our control 
remains optimal. On the other hand, Proposition 15.21 ensures that s is adaptive over the 
class of Besov balls when 9 > 5. 



5 Minimax results 

5.1 Approximation results on Besov spaces 
Besov balls. 

Throughout this section, A = {(j, k), j G N, k G Z} and {i^j,k, (j>&) £ A} denotes an 
r-regular wavelet basis as introduced in Section 4.1. Let a,p be two positive numbers such 
that a + 1/2 — 1/p > 0. For all functions t G ^(m)) * = fc)eA tj,ki>j,k, we say that i 
belongs to the Besov ball Sq, )P)C)0 (Mi) on the real line if ||t|| apoo < Mi where 

ii*iL, P ,oo = S up2^/ 2 -i/ P ) fc^Y 7 *. 



\kez 



It is easy to check that if p > 2 S a>P)00 (Mi) C i? a ,2,oo(Mi) so that upper bounds on 
B a ,2,oo(Mi) yield upper bounds on B aiP>00 (Mi). 
Approximation results on Besov spaces. 

We have the following result (Birge &; Massart Q| Section 4.7.1). Suppose that the support 
of s equals [0, 1] and that s belongs to the Besov ball -B a ,2,oo(l), then whenever r > a — 1, 

S llq,2,oo -2J m q 



5.2 Minimax rates of convergence for the PLSE 

We can derive from Theorems 13.11 and 14.11 adaptation results to unknown smoothness over 
Besov Balls. 

Proposition 5.1 Assume that the process (X n ) ne z is stricly stationary and arithmetically 
[AR] (3-mixing with mixing rate 9 > 2 and that its marginal distribution admits a density 
s with respect to the Lebesgue measure fi, that s is supported in [0, 1] and that s G L 2 (ijl). 
For all a,M\ > 0, the PLSE s defined in Theorem \ 3.1\ for the collection of models [W] 
satisfies 

W ^ O to /|| ~ 1,2 ^ -2a/(2a+l)\ ^ L Mii. lo E,n) ie+2)K 

Vk > 2, sup F [\\s - s\\ 2 > L Mag n za/ ^ a+1 > ) < -7- . 

Proposition 5.2 Assume that the process (X n ) ne z is stricly stationary and arithmetically 
[AR] T-mixing with mixing rate 9 > 5 and that its marginal distribution admits a density 
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s with respect to the Lebesgue measure p, that s is supported in [0, 1] and that s S L 2 {p). 
For all a, M\ > 0, the PLSE s defined in Theorem\4- 1\ satisfies 



sup K(\\s-s\\l)<L Ml ^ n- 2a /^ +1 \ 

sGB a ,2,oo(A/i) V 7 

Remark: Proposition 15.21 can be compared to Theorem 3.1 in Gannaz & Wintenberger 
181 ] - They prove near minimax results for the thresholded wavelet estimator introduced 
by Donoho et al. [3| in a (^-dependent setting (for a definition of the coefficient (f>, we 
refer to Dedecker & Prieur [15J]). Basically, with our notations, their result can be stated 
as follows: if (X n ) ne z is ^-mixing with 4>\{r) < Ce~ ar for some constants C, a, 6, then the 
thresholded wavelet estimator s of s satisfies 

fu u2\ /logn\ 2Q /( 2Q+1 ) 

Vq > 0, Vp>l, sup E I \\s - s\\ 2 I < L M ,Mi,<x, P \ 

seBa,p,co(Afi)nL»(A/) v 7 V n J 

The main advantage of their result is that they can deal with Besov balls with regularity 
1 < p < 2. However, in the regular case, when p > 2, we have been able to remove the extra 
logn factor. Moreover, our result only requires arithmetical [AR] rates of convergence for 
the mixing coefficients and we do not have to suppose that s is bounded. 



6 Proofs. 

6.1 Proofs of the minimax results. 

Proof of Proposition \5.1l - 

Let a > and Mi > and assume that s € -B Q ,2,oo(-^i)- Let M. n = {m £ M n , D m > 
Co (log n) 71 }. By Theorem 13. 1\ there exists a constant Lq > such that 

\~s-s\\l>L e inf I \\s - s m \\\ -\ — \ ) < sV 8 / . 20 

It appears from the proof of Theorem 13.11 that the constant L s depends only on ||s||2 and 
that it is a nondecreasing function of ||s||2 so that L s can be uniformly bounded over 
Ba,2,oo{M\) by a constant Lm\ so that, by (|20l) 



s - s\\l > L e inf <il|s — s m ||,H — H< 



1 2 ^0 11L i 1 11° °m\\2 

m.£M n I n ) ) 

In particular, for a model m in M. n with dimension _D m such that 

co(logn) 71 < ImVtWi) < Dm < L2 „i/(2«+i) j 

we have 



0/2 



* " ^ > Le [\\s - s m \\ 2 2 + -fjj < nd/2 
Since s belongs to #0,2,00 (Ml) , we can use Inequality (fT9j) to get 

|| s — s m ||2 < L Q , i jvfi-D m 2a - 

Thus we obtain 

L A/l (logn)( e + 2 )* 



An\\ ^ L Ml (logn)( e+2 ) K 



{\\s-s\\ 2 > L Ml , a ,en ')< ^ •□ 
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Proof of Proposition 1 5. 2b 

Let a > and Mi > and assume that s € -B a ,2,oo(-^i)- By Theorem 14.11 we have 

^fp-sll-i) < L e \ inf {\\s - s m \\l + — } 
v ' \meMn n 

Inequality (fl~9|) leads to \\s — s m \^ < L a ^M 1 D^ 2a , so that for a model m in M n with 
dimension D m such that 

co(logn)^ < Lin^+i) < £, m < L^V^+i), 

we find 

E (||S <L w/l n- 2 ^ +1 ).D 

6.2 Proof of Theorem SHI 

For all m in .M n , we have, by definition of m 

7n(s) + pen(m) < 7„(s m J + pen(m ) 
-P7(s) + zvy(«) + pen(m) < Pj(s mo ) + ivy(s m J + pen(m ) 
P~f(s) - Pj(s) - 2u n s + pen(m) < P^(s mo ) - P~f(s) - 2v n s mo + pen(m D ) 

Since for all t G £2(1^)) Pl{t) — Pl{s) = P — S H|> we have 

||s - s|| 2 < ||s - s mo H2 + pen(m Q ) - V(m ) - (pen(m) - V(m)) - 2u n (s mo - (21) 
where, for all m G _A/f n 

y(m) = 2v n (s m - Sm) = 2 ^n(^j,k)- 

[j,k)em 

This decomposition is different from the one used in Birge & Massart j§| and in Comte & 



Merlevede [13j]. It allows to improve the constant in the oracle inequality in the /3-mixing 
case. Moreover, we choose to prove an oracle inequality of the form (I12p for /3-mixing 
sequences, which allows to assume only 6 > 2 instead of 6 > 3. Let us now give a sketch 
of the proof: 

1. we build an event with P(^£0 — P0<i such that, on v n = u*, where u* 
is built with independent data. A suitable choice of the integers p and q leads to 
p(3 g < C(lnn) r n~ e / 2 . 

2. We use the concentration's inequality (17. 4p of Birge & Massart j§| for x 2 -tyP e statis- 
tics, derived from Talagrand's inequality. This allows us to find pi(m) such that on 
an event Qi with P(Of n Qc) — L\,sCn 

sup {V(m) — pi(m)} < 0. 

m£Mn 

c n < C(hin) r n~ e l 2 and Li iS is some constant depending on s. 

3. From Bernstein's inequality, we prove that, for all m, m' £ M. n , there exists P2(m, m!) 
such that, for all r] > 0, on an event Q2 with F^f^ ^ ^c) < -^2,sC n , 

{/ \ ^ / /\ II s ™. — s m'|l2 1 ^ n 
fn(s m - s m >) - -p 2 (m,m ) > < 0. 
I IT] J 

Moreover, for all m,m' € A^ n , P2{m,m') < p2(m,m) + p2(m' , m'). 
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4. We have \\s m — s mo |l2 — \\ s m ~ s \\2 + II s ~~ s mJl2 because s m — s mo is either the 
projection of s m — s onto S mo or the projection of s — s mo onto S m . Take pen(m) > 
Pi(m) + r]p2(m,m), we have, on fii fl ^2 H &c 



II ~l|2 - 1 1 2 Vm , \ V m . . 

\\s-s\\ 2 < \\s-s mo \\ 2 — + pen(m ) — (22) 

-(pen(m) -pi(m)) - (pi(m) - F(m)) - 2i/ n (s mo - s^) 

^11 l|2 , ( \ V ( m o) , ~ M 

< II s - s m |l2 + pen(m D ) r)p 2 {m,m) 

II l|2 

, / * \ i \\ s m ~ Sfti 2 / 00 x 

+r]p2{m,m ) H (23) 

/ ]\ 2 1 2 

1 s-s , < (1 + -) s - s mo L + pen(m ) + r/p 2 (w ,m ). (24) 

V >7/ 7 7 

In ([23]), we used that V(m G ) = 2||s mo - s mo ||| > 0. In we used that y mo > 0. 

Pythagoras Theorem gives 

II _ * II 2 _ ^( m °) _ || _ [i 2 j II _ n2 || _ — 1| 2 
H'S *^»Tio II 2 2 — "^'m. 1 1 2 ■met; || S Sm||2 — 11^ ^11 2 ' 

Finally, we prove that we can choose 77 = (logn) 7 , with 7 > such that r)P2(m ,m ) = 

o(pen(m Q )) and we conclude the proof of (|3.ip from the previous inequalities. 

We decompose the proof in several claims corresponding to the previous steps. 

Claim 1 : For all I = 0, ...,p — 1, let us define A\ = (X2i q +i, X( 2 i+i) q ) an d &l = 

{X(2i+i) q+ i,~;X { 2i+2)q)- There exist random vectors A* = (X* lq+1 , X* 2l+1)q ) and Bf = 

( X {2l+l)q+V-i X (2l+2)q) SUCn that for a11 1 = °> ">P ~ 1 : 

1. A* and A\ have the same law, 

2. A* h is independent of Aq, A;_i, Aq..., A*_-^ 

3. P(A, / A*) < & 

the same being true for the variables S;. 
Proof of Claim 1 : 

The proof is derived from Berbee's lemma, we refer to Proposition 5.1 in Viennet 25J] 
for further details about this construction. □ 

Hereafter, we assume that, for some k > 2, y / n(log n) K /2 < p < ■ v /n(logn) K and for the 
sake of simplicity that pq = n/2, the modifications needed to handle the extra term when 
q = [n/{2p)} being straightforward. Let 9. c = {V/ = 0, ...,p - 1 A t = A*, B x = Bf}. We 
have 

p(n&) < 2 P p q < 2 ^ (lQgn)(e+2)K 
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8/2 



Let us first deal with the quadratic term V(m). 

Claim 2 : Under the assumptions of Theorem lg.il let e > 0, 1 < 7 < n/2. We define 
L\ = 2$ 2 ki, L\ = 8^ 2 ^, L 3 = 2$/e(e) and 
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Then, we have 



f J T/ , v Li, m D m \ \ ( (logn) 7 

I sup < V[m) — \ > n ilc ] < L Si7 exp 



n 



v\\ 



s 2 



where L s ^ = 2 ^^ =1 exp(— (log D) 7 / ||s|| 2 )• /n particular, for all r > 0, t/iere exists a 
constant L' sr depending on \\s\\ 2 , such that 

sup (v(m) - Ll ' mDm ) > o n n c ) < L ' s ' r 



n 



n' 



Remark : When (L 2 /Li) 8 (logn) 4 ( 2K -T) < D m < n, we have 



4Lf. 



Proof of Claim 2 : 

Let P*(t) = Y^=i t(X*)/n and u*(t) = (P* - P)t, we have 

V(m)ln G =2 (K) 2 (^,k)ln c . 

Let B 1 {S m ) = {te S rn ; \\t\\ 2 < 1}. Vt G fli(5 m ), let i(xi, ...,z 5 ) = £f =1 *(a*)/2g and for 
all functions 5 : IR 9 -> E let 



p-i 



p-i 



P A, P 5 = -E^*)' P b,p9 = -T,9( B D' p 9= [ 9^A{d^) 
P j=o P j=o J 

and u A , P 9 = (Pa, p ~ P)9, ^B, P 9 = (Pb, p ~ P)d- 



Now we have 



(j,fi)6m (j,fc)6m. (j,k)£m 

In order to handle these terms, we use Proposition 17.41 which is stated in Section 7. Taking 



B m= Yl Var(^i,fc(Ai)), = sup Var^i)), and iZ 



(j,k)€m 



teBi(S m ) 



0',fc)em 



(^',fc) 2 



we have 



(1 + e) /2x iJ m x , 

y/P \ P V 



(26) 



In order to evaluate B m , V m and H m , we use Viennet's inequality ([541) . There exists a 
function b such that, for all p = 1,2, P|6| p < k p where n p is defined in ([2]) and for all 
functions t G L,2(P), 

Var(i(Ai)) < -P6t 2 . 
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Thus 



B 2 m = Y, Vaxfe^x)) < i ^ P^jfc< 

(j,fc)£m U,k)Sm 



E ^ 



«1 



Prom Assumption [M 2 ], E(j,fc)gm^,fc 



< $ 2 £> m , thus, 



2 $ 2 KlD n 



a' < 



(27) 



From Viennet's and Cauchy-Schwarz inequalities 

= sup Var(t(Ai)) < sup < sup pH^ ^ - - - — . 

teBi(S m ) teSi(S m ) 9 teSi(S m ) 9 



Since i G Bi(S m ), we have by Cauchy-Schwarz inequality 

(P* 2 ) 1/2 < (\\t\L Plla Ikll2) 1/2 < (Plloo Nl 2 ) 



1/2 



From Assumption [M2], we have ||t < &\/D m , and from Viennet's inequality Pb 2 < 
K2 < 00, thus we obtain 



v£<* 8/2 (IMI a Ka) 1/a 



^3/4 



(28) 



Finally, from Assumption [M2], we have, using Cauchy-Schwarz inequality 



Hi 



Let y n > 0. We define 



E 

{j,k)em 



< 



(j,fc)gm 



< 



(29) 



r n , u , r (logDm^ + Vn , r (logA^ + y, 

-^m = (J- + ej-^l + ^24/ — r ^3" 



2Z^ /4 



2(logre) 



We apply Inequality ([26]) with 2 = ((logD m ,) 7 + y n )/ IMI 2 and the evaluations (|27l) . ((28 
and (|29]1 . Recalling that 1/p < 2/( % /n(logre) K ), this leads to 



E "Xpt 



> 



,(j,k)em 



< exp 



(log A, 

V¥\\ 



exp(- 



s 2 



In order to give an upper bound on H m x, we used that the support of s in included in 
[0,1], thus 



1 = Mi < ||s|| 2 . 



The result follows by taking y n = (logn) 7 > (logD m ) 7 .D 

Claim 3. We keep the notations k/2 > 7 > 1, L2 of the proof of Claim 2. For all 
rre, vn! G M. n we take 



Lm m/ = 4 I Ln 



(log re)' 



+ 



4$ 



(D m VD m ,y/* 3(log 



1K-7 



(30) 
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we have, for all i] > 0, 

m f *< \ \\sm-Sm'\\l r]L mm ,(D rn \jD m ,) \ "^Sp" 
P sup v n {s m -s m >) ■ > < L sn e l|s|| 2 

\m,m>£Mn It] I U J 

(log(D m V_D m ,))T 

172 

with L Sjl = 2^ mm , eMn e l|s|| 2 

Remark : The constant L sn is finite since for all x,y > 0, (log(x V y)) 7 > ((logx) 7 + 
(logy) 7 )/2. 

As in Claim 2, when (L 2 /Li) 8 (log n)^ 2K ~^ < D m < n, we have 

L m , m , < ( 1 + ^L) (logny^hLl 



Proof of Claim 3. 

We keep the notations of the proof of Claim 2 and for m,m' E M. n , let t m ^ m i = 
(s m — s m i)/ \\s m — s m '|| 2 - We use the inequality 2ab < a?r]~ l + 6 2 r/, which holds for all 
a, b G R, r] > 0. This leads to 

II II 2 

*( \ II II *fi \ <r 1 1 s ' m Sm ' 1 1 2 1 V_ 1 \\ 2 

"nl^w &m' J — ||Sm Sm' \\ 2 V"m;m' ) _i ' 2 \ n\ m t m ')j 

II II 2 

+ TT \VA,p\tm,m') + VB,p\tm,m' )) 



27] 2 

|2 



< 



o m ' ||2 



2r] 

Now from Bernstein's inequality (see Section 7), we have 



m,m 



Vx > 0, P ( > ^ 2Var( ^ (Al))x + ll ^ IU " | < e-. (31) 

Prom Viennet's and Cauchy-Schwarz inequalities, we have 



Var^CAO) < — =^ < — V 



Moreover 

(.2 ^ „ dj.2 

m,m 



Pi) < K2, Pt mm i < ||tm,m'||ooPm,m'||2||s||2- 

Since i m ,m' G S* m U 5 m ' and ||t m ,m'||2 — 1) we have, from Assumption [M 2 ] ||t m ,m'||oo < 
<&y/D m V £> m '. Let y„ > 0. We apply Inequality {31]) with x = [(log(Z) m V £> m ')) 7 + 
3/n]/ ||s||2 /2 - We define 



L 'm,m> _ ( L {\og(D m V Dm')V +Vn 4$ [(log (An V D m , )) 7 + y n ] 

4 ^ 2 y 2(D m VD m ,y/ 4 6(logn) K 

we have 
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The result follows by taking y n = (logn) 7 and using 2 < D m < n. 
Conclusion of the proof: 

Let rj > and pen'(m) > (Li >m + r]L m ^ m )D m /n where Li, m and L m ^ m are defined respec- 
tively by ([25]) and (|30|) . From Claims 1, 2 and 3 and (|2l|). we obtain that, for all m and 
with probability larger than L s ^(logn)^ d+2 ^ K n" e ^ 2 

(1 - -) \\s - s\\l < (1 + -) || s - s mo \\l + pen'(m ) + r]L(m ,m )—^ . (32) 
Tj Tj n 

Assume that D m > (L2/ Li) 8 (log n) 4 ( 2K-7 \ then we have from remarks 16.21 and 

-1 2 



l + e+ (1 + ^) (logn)" 



2 



AL\ and 



L m ,m < U + £^=) {\ogn)- 2 ^AL\. 



Take 77 = (logn) K 7 , we have {L\, mo + r\L mo ^ rio )D mo j n < Cpen(m Q ). Fix e > such that 
[1 + e] 2 < K/A. Since k > 7, for n > ti , we have Li t m ~i~ ^B mm < KL\, thus, inequality 
(THS)) follows follows from (j3"2"j) as soon as n > n . We remove the condition n > n Q by 
improving the constant L s in (fT3|) if necessary.D 

6.3 Proof of Theorem I4TT1 

The proof follows the previous one, the main difference is that the coupling lemma (Claim 
1) as well as the covariance inequalities are much harder to handle in the r-mixing case. 
This leads to more technical computations to recover the results obtained in the /3-mixing 
case (see Claims 2, 3 and the proof of inequality ([45]) ). We start with the decomposition 
OT1) . As in the previous proof, the decomposition of the risk given in Birge & Massart [3] 
or in Comte & Merlevede [l3| could be used. This leads to a loss in the constant in front 
of the main term in (fT5|) without avoiding any of the main difficulties. We divide the proof 
in four claims. 

Claim 1 : For all I = 0,...,p — 1, let us denote by A\ = (X2i q +i, Xni+i)q) and B\ = 
(^(2i+i) g +i> -,X(2i+2)q)- There exist random vectors A t = (X* lq+1 , ^* 2m)? ) and B* = 

( X *2l+l)q+V-> X (2l+2)q) SUch ^at for all I = 0, ...,p - I : 

• A* h and Ai have the same law, 

• A* h is independent of Ao, Ai-i, Aq..., A*_! 

• EQAi-AfU) < qr q 

the same being true for the variables B{. 
Proof of Claim 1 : 



We use the same recursive construction as Viennet [251 ] . 
Let (^j)o<j<p-i be a sequence of independent random variables uniformly distributed over 
[0, 1] and independent of the sequence (^4j)o<j<p-i- Let Aq = (X*, ...,X*) be the random 
variable given by equality ((4j) for A'f = cr(Xj, i < —q), Aq and 5q. 

Now suppose that we have built the variables A[ for I < I'. From equality ((H) applied to 
the cr-algebra a(Ai,A t , I < I'), Ay and <fy, there exists a random variable A*, satisfying 
the hypotheses of Claim 1. 

We build in the same way the variables Bf for all I = 0, ...,p — 1. □ 



IT 



We keep the notations u*, i>A,p, &b, p , i and Bi(S m ) that we introduced in the proof of The- 
orem l3.ll As in the proof of Theorem 13. 11 we assume that, for some k > 2, y / n(logn) K /2 < 
P < ■ v /n(logn) K and for the sake of simplicity that pq = n/2, the modifications needed to 
handle the extra term when q = [n/(2p)] being straightforward. We have 

V{m) = Y, ^(^,fe)<2 Yl (Pn-K) 2 (^,k) + 2 (33) 

(jr',fc)em (j,fc)£m (j',fe)£m 

Claim 2 : There exists a constant L = La,Kl,Koo,k0 such that 



E( ^ {{P n -P* n )^ hk )) 2 \ <L 



(log n 



,k(0+1) 



„(fl-3)/2 



(34) 



Proof of Claim 2 : 



< E sup ^ (P re -P*) 2 (^- fc 

< e e E((p n -p n *) 2 fe)) 

m£Mn (j,k)£m 

2 P 

- ~2 E ^2(9A,m(j> k i l > l> ) +9B,rn(j,k,l,l')) 
P meM n l,l'=l 



with 



9m,A(j> k, I, l') = E [ fe(A) - ^,fc(^D) " ?M^*)) ] • 



We develop this last term and we get, since 



l^fcOz) - i>3,k{y)\ < 



K L 2^' 2 \x - y\ c 
2^ 



g Ajm (j,k,l,l') < E[ ^ 

, 0',fc)Gm 



< E £ |^- fe (A)-^, fc (A*)|^2 3 ^ 



K {j,k)£m 
K L T ( 



2<? 



< sup { 2 2 3 ^ 2 fo(x)-^ fe (y)| 



(j,k)em 



< ^1J2 2 3J/2 sup { £ - 



j=o Uez 



< -AK L K 0O 2 2Jm r 9 since 



fc€2 



IS 



We can do the same computations for the term gB,m(j, k, I, I') and we obtain 

(logn) K ( e+1 ) 



E - P nM,k)f )<Lr q Y 22Jm ^ LT ^ 2Jn ^ L - 



<j,k£m 



meMr 



n (0-3)/2 



The last inequality comes from q > y / n/(2(log n) K ) and Assumption [AR], the one before 
comes from Assumption [W]. □ 

Claim 3. Let us keep the notations of Theorem \4-l\ let u = 6/(7 + 6) < 1/2 and recall 
that k > 2. Let 7 be a real number in (1, k/2). Let 

00 00 
L\ = AK^Kbv J^k, 4 = 2<5>K U BV Y 01 L 3 = K (e)$ 



1=0 



k=0 



AT Ahj.\ fU-\rj-T (logAn)^ , T (logAn) 7 

and L 1>m = 4(1 + e) I (1 + e)L 1 + L 2 \j _ 1/9 _„ + L 3 (bgn ^ 



n l/2-u 



There exists a constant L, such that 



(35) 



E sup i K) 2 fe) 



< 



L x 
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n 



(j,fc)£m 

Remark : The series Yl'iZo & an( ^ Sfclo Pk are convergent under our hypotheses on the 
coefficients r. Since s G £ 2 ([0, 1]), we have from Inequality , /3; < 2||s||2 t/ and thus 
01 < 2|| S ||^ /3 (l + /)-( 1 + 9 )/ 3 . The series ^Zfe^o @k converge since 6 > 5 and 

u(l + 0) 2(1 + 0) 0-5 r 

— = — = H > 1. 

3 7 + 6 6 + 7 

We use here instead of r which allows to take L\ not depending on ||s||2- 

Proof of Claim 3 : 

As in the previous section we use the following decomposition 

Y ^nfi^j.k) = Y ( 9 Ap$j,k) + VB )P {i , j,k)) 2 
(j,fc)Gm (j,k)£m 

< 2 Y {vA,p($j,k)) 2 + 2 Y {v B ,p{$j,k)) 2 
(j,k)em (j,k)em 

We treat both terms with Proposition 17.41 applied to the random variables (A*)o=i,..,p-i 
and (L>i )i=o,.., P -i and to the class of functions {(^Pj,k)(j,k)em\- Let 



Bl = Y Var (^Mi)) , Vl = sup Var^Ax)), H 2 m = \\ Y t 



(j,k)€m 

We have, from Proposition 17,41 



j,felloo' 



(j,k)em 



\/x > 0, 



{j,k)Gm 



(1 + e) D , T/ /2i" , , 

— B m + K m W h K{€) 

VP V P P 



< e" 



(36) 
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Let us now evaluate B m , V m and H m , we have 



E Var^^W] 

V (j,fe)6m \i=l / 



Prom (QD) and jEJ we have Vj, k Uj,k\\ B v < K BV ^ /2 andVj || £ fcgZ KMIU < AK^I 2 . 
Thus, from Inequality ([5]) 



(j,k)&m \i=l 



(j,k)em 1=1 

Jm 1 

^ 2 i E E E ii^Hbv e (i^(^i)i%(^i), 

j=0 fceZ 2=1 



1/71 



i=o 



E iV'i.fe^o 



k& 
00 



< 2q[AK 0O K BV J2Pl) D r i 



1=0 



The last inequality comes from Assumption [W]. 
Since L\ = AK^Kby A we nave 



T 2 r> 



(37) 



Let us deal with the term K? . We have 



Vm < SUp 



Var(t(Ai)) <-— ^(g+l - fc) sup ICov^Ax), t(X k ))\ (38) 

*6Bl(Sm) \ l 1> k=1 teBtiSm) 

Prom Inequality (EJ, we have 

ICov^^o,^^))! < Htll^lltlU^-i. 

Since t belongs to B 1 (5 m ), we have i = E(j,fc)em a j,k^j,k, with E(j,fc)em a i,fc - 1 - Thus ' 
by Cauchy-Schwarz inequality 



1 1 

^2\t(x i+1 ) - t(xi)\ < ^ \ a j,k\ E ~ ^jM X i)\ 

\<J,k)em V i / / 



i=l 



< 



j,k)£m i= 


1 




1/2 


( E 4.) 




\(i,fc)em / 




( E ife 




\(j,fc)Gm 





1/2 



1/2 



Thus ||i|| By < D m K BV . From Assumption [M 2 ], we have < $y/D m . Thus 

\Cav{t{X x ),t{X k ))\ < ZKev^dU 2 . (39) 
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Moreover, we have by Cauchy-Schwarz inequality and [M 2 ] 

\Cov{t{X{),t{X k ))\ < HtlU ||t|| a |M| 2 < $ || S || 2 x/A^- 



(40) 



We use the inequality a A 6 < a"6 " with 



<&K BV p k -iD% 2 , b = $ ||«|L /d^, 



6 1 
< -. 



7 + # 2 
Prom (|39|) and (|iD]l ; we derive that 

|Cov(t(Xi),t(X fc ))| < L' fc Z^ /2+ " where L' k = & (k bv ~^i) U 
Pluging this inequality in (f551) . we obtain 

r2 || B ||l-w n 1 / 2+u °° 

since L 2 = 2$^ £4". 

fe=0 



y 2 < 

r m — 



(41) 



Finally, we have from hypothesis [M 2 ] 



Hi < - 



E t 



J k 



< 



(42) 



Let y > and let us apply Inequality (|36l) with x = ((log D m ) 7 / ||s|| 2 ") + {y/DlJ l 2+u ) 
We have, from (1371). (1411 and (1451) 



/- ^2/7 \ . | /i | v torn , / (log An)' 

E K P ) fe) > a + ^ + I 



+ 



1/2+u 



\| 2pq 



1-u n l/2+u 



(logAnJ 7 , y 



ll-U 



+ 



n l/2+n 



(logg m )T 

II 111 - ^ 

< e l|s|| 2 e 



y 



Then, we use the inequality \Ja + ]3 < ^fa + \/~j3 with 



a 



(log D m ) 

II „I|1-M 



and 



n l/2+u 



and the inequality (a + b) 2 < (1 + e)a 2 + (1 + e 1 )fe 2 with 
a= ( (l + e)Li + L 2 



(log An) 7 L 3 (logAn) 7 



n l/2-u 



+ 



1-u 



(logn)'' 



A 



and 6= — ( A\/||s|| 2 u y + 



Ay 



(logn) K An 



Setting L m = (1 + e)a 2 n/ D m , we obtain 

Ei- \2ii \ L m D m (1+e 1 ) 
{VA,p) Wj,k) > 
J n n 



K (j,k)£m 
OogOroT 



L 2 \/\\s\\l~ u y + 



L-zy 



(logn) K A 



< e l|s|| 2 e 



1 --1^1 



-(1/2+u) 



V 
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Thus, for all y > 0, 



T n T \ (log Dm) 7 n -(l/2+u) 



sup £ (^(^ >^( y + y 2 ) < £ e 



II 2 



where L s = 2(1 + e" 1 ) 



2 



(L 2 ^|| S ||i- u )VL 3 /((log2m 
equality to prove Claim 3.D 

Claim 4 : We keep the notations of the previous Claims. Let 



We can integrate this last in- 



II >\ A T / (log(flmVOmQ)'' , g ^ 



T/ien £/iere exists a constant L s ^ depending on \\s\\ 2 and 9 such that, for all r] > 

^ / , \ ||s m -s m /||2 L 2 (m,m')(D m VD m -) h 

E sup < v n {s m - S m i) --T] > < 

\m,m'eM n { 27 1 J J 

Proof of Claim 4 ■' 



E SUp \Vn\Sm-Sm') ^ V > 

\m,m'eM n { 2r l n ) / 

<E( sup {P n -P*)( Sm -s m ,) ] 

\ m,m' I 

lW f J */ \ lljm ~ *wl|| L 2 (m,m')(D m V D m >) \\ , 

+E sup < i/ n (s m - s m /) rj } . 44 

\m,m> { 277 n ) J 

Since V/ = 0, ...,p - 1, E - < qr q , we have 

E[ sup(P n -P*)(s m -s m ,)) < 2j2 E (\(s m -s m ')(A 1 )-(s m -s m ,)(A* 1 )\) 

\ / m,m' 

< T q ^ Li P( s m - S m i). 

m,m' 

When m C m', we have, for all i,y£l, using Assumption [W], 

\(Sm ~ S m/ ){X - y)\ < y, 2 y^' JV^^) 

Let us fix j G [J m + 1, J m /], from Assumption [W], there is less than A indexes k G Z 
such that il)j : k{x) ^ 0, thus there is less than 2^4 indexes such that ^^(a;) — il>j,k{y)\ 7^ 0. 
Hence 



< 2A||s|| 2 i ; C L 2 3j ' /2 . 



2^,\PWj,k\ i _ | < 2Asup|PVj,fc|Lip(V'i,fc) 
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Thus, Lip(s m -8 m >) < A \\s\\ 2 K L y/82 3J ™> / 2 /(y/8 - 1) and by Assumptions [W], [AR] and 
the value of q, 

( \ floenW e ' +1 ) +1 
E sup(P n - P*)(s m - s m .) < L,n 3 / 2 (logn)r g < L 8 K a . (45) 

\m,m' I n 



Let us deal with the other term in (1441) . We have, Mi] > 

II II 2 

V n (S m -S m >) < — h - {VA,p{tm,m' ) + ^B,p{tm.,m' ) ) 

lis S |P 

m,m' m,m' )) 2 (46) 

where, as in the proof of Theorem [34^ t m m > — (sm ^m')/||^m s m ' || 2- We apply Bern- 
stein's inequality to the function i m ^ m i and the variables A*, we have 



Vx > 0, P I u A)P (t m>m >) > W ^ 1 '— I < e . (47) 

We proceed as in the proof of Claim 3 to control this variance. We have, by stationarity 
of the process (X n )neZ, 

1 9-1 

Var(t mim /(^ )) = Tpi ~ fc ) Cov ( t m,m'(^l)^m,m'(^fc+l))- 

q k=0 

Prom Inequality (J5j) , we have 

|Cov(t m . m '(Ai),t mjm /(Afc + i))| < ||i m ,m'|| B y 1 1 t m ,m' \ | ^ A • 

Let mAm' be the set of indexes that belong tomUm' but do not belong to mC\m! . We 
use the same computations as in the proof of Claim 3 to get 



|^m,m' || By — 



BV 



< / Yl Hj,k\\ 2 Bv^ K Bv(D m \J D, 

V (j,k)em'Am 



Since Him^'H^ = &\/D m V D m >, we have 

(Xi),t m>m ,(X k+1 ))\ < &KBvP k (D m V D m ,f/ 2 . (48) 

Moreover, we have 



(X k+1 )) < ||t m , m '|L \\tm,m'\\ 2 \\s\\ 2 < $ ||s|| 2 ^/{D m \JD' m ). (49) 
Thus, using a A b < a u b 1 ~ u with 

6 1 

a = <S>K BV /3 k (D m V D m ,f/ 2 , b = <5> \\s\\ 2 y/(D m V D m >), and u = — — < -, 

1 -f- V Z 

we have 

\^{t m ^{X x ),t m , m ,{X k + x ))\ < <S>K u BV ~Pt\\s\\ l - u {D m \/D m ,) l ' 2+u . 
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Thus 



Moreover 



y \\t"m,m' ||oo — „^ V D, 



< 



(50) 



(51) 



Now, we use g?D with x = (log(£> TO V D m ,))~< / \\s\\\~ u + y/(D m V D m 1/2+B . Prom (jSH 
and (j5"T1) . we have for all y > 0, 



)>L, 



\ 



(D m VD ml y/z+ u 
2pq 



(log(D m V D m /))7 + 



$y/D m V Zgj / (log(£) m V An')) 7 



6p 

(log(D m VD ,))" 



ll-U 



+ 



(£> m V An') 1/2+t 



< e 



ST 15 e (DmV^lVH. 



Now we use the inequality y/a + b < \[a + \fb with 
a = (log(jD m V An')) 7 and & = 
and we obtain, using Assumption [Mi] 



i ill-" 

Nl 2 V 



< e 



^ 'A,ptm,m' 
(io g (o m yo m ,)) 

II II 1 — U 

N| 2 



L 2 (m,m')(D m VD' m ) > L ± ,^ + y) 
n 



with 



r , ( , /(log(A»V D m ,)yi $(log(D m V D m ,)P 
L 2 (m, m ) = (i^ {DmVDmj)1/ ^ + 3(i^ 



and L s = L 2 \l \\s\\\~ u V 



3(log2) K 2 u 



Thus, we obain 



(log(D m Vi3 ,))T 



L 2 (m,m')(D m y D'J L\ 



n 



n 



< e 



I: nl — u 
IHI 2 



The same result holds for VB,ptrn,m' • Thus we obtain from (|46 



(log(D m vr> ,)) 7 



< 2e 



,,1— u 
l! 2 



2?7 



"(D m Vi3 ro ,) 1 /2+" 



J| + L 2 (m,m')(AnV^) + £j + y2 A 
n n J 
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We deduce that 

\s m -s m '\\l L 2 (m,m')(D m V D' m 



P (3m, mf G M n , v*(s m - s m ,) - hm 2 J W " 2 - 4r? 
>8r 1 ^(y + y 2 ))<2 £ (e 
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m,m' €Mn 

We integrate this last inequality to get Claim 4.D 

Conclusion of the proof: 

Take 

pen'(m) > (2Li m + r)L 2 (m, m)) — , 

n 

where Li i?n and L^im, m) are defined by (|35l) and (j4*3l) respectively. Prom Claims 2, 3 and 
4, if we take the expectation in (T2T]) . we have, for some constant L s , 



D m \ + r } L ± (52) 



n I n 



E (ll s ~ < E f||s - s mo ||2 + pen'(m ) - V(m ) + 2r]L 2 (m ,m ) 
Moreover, if D m > ((L 2 /Li)(log n )«-7/2) 2 ( 7+e )/( e - 5 ) ; we have 

If s (i+e)((i +£) + (i + iL)(io gn r(-)) 2 

< (1-m) 3 + (1 + «T 1 )(1 + e)(i + ^) (logn)- 2 <«-^). (53) 

We use the inequality (a + b) 2 < (1 + e)a 2 + (1 + e.~ l )b 2 to obtain (|53l) . Moreover, we have 

L 2 (m,m) <4L 2 ((l + (logn)"^^ . 

As in the proof of Theorem 13.11 we take rj = (logn) K_7 and we fix e sufficiently small. For 
n > n a , we have 2Li jJ7l + r]L2(m,m) < KL 2 . Thus inequality (fT8l) follows from (|52]) .D 

7 Appendix 

This section is devoted to technical lemmas that are needed in the proofs. 
7.1 Covariance inequality 

Lemma 7.1 Viennet's inequality Let (X n ) n ^x be a stationary and (5 -mixing process. There 
exists a positive function b such that P(b) < Yli=o@l> PQ^) < PX^iJ P_1 A> and for all 
function h E L 2 (P) 

Var \J2 h ( x l)j < ^qP{bh 2 ). (54) 
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7.2 Concentration inequalities 

We sum up in this section the concentration inequalities we used in the proofs. We begin 
with Bernstein's inequality 

Proposition 7.2 Bernstein's inequality 

Let Xi, X n be iid random variables valued in a measurable space (X,X) and let t be a 
measurable real valued function. Let v = Var{t{X\)) and b = PH^, then, for all x > 0, we 
have 

P ( ( p„_p )4> „y5 + ^< e -.. 

Now we give the most important tool of our proof, it is a concentration's inequality for the 
supremum of the empirical process over a class of function. We give here the version of 
Bousquet [101 ] . 



Theorem 7.3 Talagrand's Theorem 

Let X±, X n be i.i.d random variables valued in some measurable space [X,X]. Let J- be 
a separable class of bounded functions from X to M and assume that all functions t in T 
are P-measurable, and satisfy Var(t(Xi)) < a 2 , ||i||oo < b. Then 



2x(a 2 + 2ME(sup, p7 rf „(£))) bx, 
t&F \t&F / V n 3n 



In particular, for all e > 0, if «(e) = 1/3 + e 1 , we have 



supu n (t) > (1 + e)E ( supz/ n (i) ) + a\j — + «(e) — | < e 
tar \t&r I V n n 



We can deduce from this Theorem a concentration's inequality for x-square type statistics. 
This is Proposition (7.3) of Massart (io| . 

Proposition 7.4 Let X\,...,X n be independent and identically distributed random vari- 
ables valued in some measurable space (X, X) . Let P denote their common distribution. 
Let (f)\ be a finite family of measurable and bounded functions on (X,X). Let 



Hl = \\J2 0aIIoo andB\ = £ Var{cf> x (X 1 )) . 



aga 



aga 



Moreover, let S A = {a G R A : Yl 



aga a A 



l} and 



Vl = sup I Var Q£ 



Then the following inequality holds, for all positive x and e 

1/2 



fe(P„-P)W) 

Vaga / 



1 + e ^ 2x , . H\x 

'n V w n 



<e~ x , 



(55) 



where «(e) = e 1 + 1/3. 
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Proof : 

Following Massart Proposition 7.3, we remark that, by Cauchy-Scbwarz's inequal- 
ity 

vfyx = sup ^ a\V n (f)\ = sup v n ^ Q>\4>\ ■ 

\AgA / ag5A AGA a£5A \AgA / 

Thus the result follows by applying Talagrand's Theorem to the class of functions 



F = I t = ^2 «a^a; a e cS A > 
I aga J 
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